Introduction

Hi! Please work through this short intro before the workshop on, especially the package installation part - work until the line which says This is the end of the intro - we’ll continue from there on together. If you come prepared, you’ll be able to follow the workshop easily; if you don’t work through this short intro, your experience will not be very good.

This R Markdown worksheet contains both introductions and code blocks. As a quick overview, the intro containts:

  1. Instructions on how to change a necessary setting if you haven’t yet.
  2. A very short&simple exercise to get you familiar with basic R.
  3. A code block that will install the required packages - please don’t forget to do that. If you don’t install the packages, you won’t be able to do any of the exercises in the workshop. You might have noticed a message above the script pane saying “Packages required but are not installed” - we’ll get to that in a second. But first:

Configure these RStudio options, to make your life about 200% easier

A quick pre-workshop exercise for beginners

If all of this so far looks and feels new to you, then this here’s something I’d recommend doing before the workshop to get a feel for R coding, so we can move on to the more interesting stuff quicker together. If you’ve used R before, feel free to skip this and go straight to the package installation part (definitely do the installation part!). Also notice how this line here is quite long and without manual line breaks - if you didn’t change the Soft Wrap setting above - you’d have to scroll left and right and left and right to be able to read it. If you successfully changed that one setting above, then the text should be nicely wrapped within the script pane, no back and forth scrolling required.

So far you’ve seen only free-running text. The shaded block below is a code block. RStudio colors it differently than the rest of the document:

# This is a comment, and below it is a line of code:
print( "Hello! Put your text cursor on this line (click on the line). Anywhere on the line. Now press CTRL+ENTER (PC) or CMD+ENTER (Mac). Just do it." )
## [1] "Hello! Put your text cursor on this line (click on the line). Anywhere on the line. Now press CTRL+ENTER (PC) or CMD+ENTER (Mac). Just do it."
# The command above, when executed (what you just did), printed the text in the *console pane* below. Also, this here is a comment - comments start with a # hashtag.
# Commented parts of the script (anything after a # ) are not executed. Feel free to add your own comments anywhere inside the code blocks.
# This R Markdown file has both code blocks (gray background in the default theme) and regular text (white background).
# Code blocks start and end with the 3 ``` symbols; make sure you don't delete them.
# Always write code inside code blocks.

Everything outside the code blocks is just regular text. Feel free to also add your own notes anywhere, including right here (no #comment symbol required). Now let’s try some more simple functions.

sum(1,10) # cursor on the line, press CTRL+ENTER (or CMD+ENTER on Mac)
# You should see the output (sum of 1 and 10) in the console. 
# Important: you can always get help for a function and check its input parameters by executing 
help(sum)  # put the name of any function in the brackets
# ...or by searching for the function by name in the Help tab on the right.

# Exercise. You can also write commands directly in the console, and executing them with ENTER. Try some more simple maths - math in R can also be written using regular math symbols (which are really also functions). Write 2*3+1 in the console below manually, and press ENTER. You can also go back to previous commands by pressing the up arrow on your keyboard. 
# Don't skip this, do it.


# Now let's plot something. The command for basic plotting is, surprisingly, plot().
plot(x = 42, main = "The greatest plot in the world") # execute the command; a plot should appear on the right.
# So that was not very exciting. But notice that a function can have multiple inputs, or arguments. In this case, the first argument is the data for the x axis (here a vector of length one), and the second is 'main', which specifies the main title of the plot. 
# You can make to plot pop out in a bigger window by pressing the 'Zoom' button above the plot panel on the right.

# Let's plot some 100 random numbers, generated with the rnorm() function.
hist(x=rnorm(100))               # a histogram
plot(x=rnorm(100), y=rnorm(100)) # a scatterplot

# Note that in R, spaces and line breaks don't matter in terms of syntax, so this gives the same result:
plot(x= rnorm(100) , 
     y = rnorm( 100)
     )    

# By the way, you can also always search in the script file using CTRL+F (CMD+F on a Mac).
# If you make a mistake in the script, you can always Undo it (CTRL+Z; or CMD+Z on a Mac).

These basic plots don’t look exactly amazing. I know - that’s why we’ll be using packages like ggplot2 and plotly in the workshop that make it easy to produce beautiful colorful wonderful graphs, and we’ll see how to make them interactive too.

# Let's try another thing. We'll use the paste() command, which concatenates (glues) strings (words and such) into a single string.
paste("hello", "world") # two inputs, outputs single string

# Most functions follow this pattern: there's input(s) and maybe some parameters, separated by commas, something is done to the input, and then there's an output. Here the "input" is the two strings "hello" and "world".
# If an output is not assigned to some object, it usually just gets printed in the console. It would be easier to work with data, if we saved it in an object. For this, we need to learn assignement, which in R works using the equals = symbol.
sentence = paste("Hello", "world!")  
# what it means: "sentence" is the arbitrary name of a (new) object, the equals sign = signifies assignement, with the object on the left and the data on the right 
# (note that there's two ways of doing assignement, to define objects in R: either with = or <- , we'll be using the = here). 

# In this case, the "data" is the output of the paste() function. Instead of printing in the console, the output is assigned to the object.
sentence # run this line to inspect: calling an object usually prints its contents into the console below. Try it.

# Let's try assignement one more time; let's create an object with your name. 
myname = " "    # put your name between the quotes, like this: myname = "Andres"
# Now run this line: note that the output will depend on what you assigned to myname:
paste("Hello", myname, "you're doing great!")  
# This works because paste() concatenates the strings with the value of the object myname.

That’s the end of the beginners exercise block. Now make sure to do the installation part below, and then you’re all done!

Install the packages now

Let’s get to installing. Think of packages like boxes of ready-made code that somebody else has written, that you won’t have to write and can just use. If you get an error while installing, see the troubleshooter below. RStudio now also notifies of missing packages on top of the script pane.

Run the block below - this will take some time, so while it’s doing its thing, go back to the course webpage and read through the section 4, the Code Troubleshooting bit (also please don’t forget to put your name in the confirmation form there, once package installation finishes). In some cases R will ask for your further input to continue with installation (if so, see the steps below). It will most likely ask you if you’d like to install packages from source - if so, click No (or type no if the prompt appears in the Console).

# Run this code block; it should start throwing messages about installing a bunch of stuff in the console. This will take some time but only needs to be done once!
p=c("tidyverse","ggbeeswarm", "patchwork", "shadowtext", "quanteda", "quanteda.textplots", "quanteda.textstats", "plotly", "rworldmap", "maps", "gapminder", "visNetwork", "gsbm", "igraph", "rayshader");install.packages(p);x=p%in%rownames(installed.packages());if(all(x)){print("All packages installed successfully!")}else{print(paste("Failed to install:", paste(p[!x]), ", try again and make sure you have internet connection."))}
# If it asks "Do you want to install from sources the package which needs compilation?" just go for "no".

Troubleshooting: if you get an error or confusing prompt when running the above, see here; otherwise don’t mind this part:




This is the end of the intro.

Thanks for installing everything and preparing your RStudio so that we can go straight to the interesting stuff in the workshop and start learning useful things together.

Stop here for now, and we’ll continue together in the workshop. Please don’t forget to sign your name into the confirmation form on the installation instruction page though!




(P.S. If you’re tempted to scroll onward and check out the materials beforehand, feel free, but also don’t be alarmed by the fairly large number of exercises under each section - this is so that if some people already know some R, they can do those extra ones instead of being bored out of their minds. It’s totally ok if you take it slow and only do the first couple exercises under each section during the workshop. But also whenever you feel like you’re falling behind, raise a hand.) —

——-

Welcome to the workshop!

This is where we start together in the live workshop. If you have a little message on top of this window saying packages are missing, please click it or go back to the package installation section to install ASAP.

I also set up a troubleshooting clipboard: if needed, describe your error there and then ask for assistance over the chat: https://hackmd.io/@andreskarjus/HyaRgdbxY/edit

Load the packages

Let’s start by loading the required packages. I’ll put this right here in the beginning so you won’t miss it. Throughout this worksheet, you’ll see a number of functions and also a few datasets from these packages. If you run the code and won’t get any errors in the console, and a plot saying Welcome! appears on the right, in the Plots pane, then packages loading probably worked. Yay!

# Load the necessary packages - this needs to be done every time you restart R
# To run the entire code block here, click the little green triangle > in the top right corner of the code block. Do that now.
# Or put your cursor on the first line of the code and press CTRL+ENTER (CMD+ENTER)

suppressWarnings(suppressMessages({  # -> Run this! (it might take a moment)
  library(tidyverse)          # includes ggplot2, dplyr, tibble
  library(gapminder)          # provides a dataset we'll need
  library(plotly)             # interactive plots
  library(quanteda)           # a corpus pkg & its addons
  library(quanteda.textplots) #
  library(rworldmap)          # package for maps
  ggplot(gapminder)+annotate("text",x=0,y=0,label="Welcome!") # a little test
}))


Basic data operations and plots

Let’s get down to business.

# We will be using the gapminder dataset from the `gapminder` package that we loaded above
library(gapminder) # this call is just here to remind you that we're using this package

# We can inspect the data using convenient R commands.
class(gapminder)    # type of the object: it's a "tibble", a kind of dataframe (I know, this probably doesn't help much right now. It's a table of sorts:)
dim(gapminder)      # dimensions of that table
summary(gapminder)  # produces an automatic summary of the columns
head(gapminder)     # prints the first rows

# In RStudio, you can also have a look at dataframe type objects by clicking on the little "table" icon next to it in the Environment section (top right), or by running this command:
View(gapminder) # this will open a new tab next to your script tab.

# help(gapminder)     # built in datasets often also have help files attached; this one is quite helpful - go have a look what the variables actually stand for, before moving on.


# Accessing values in the dataset
gapminder[1:6, ]  # a slice of first 6 rows; the syntax: [rows, columns] 
gapminder[1:6 , c("country", "pop", "year")]  # select 3 columns and first 6 rows
# This is how base R works; we can use the Tidyverse package dplyr to do this in a more transparent way:
gapminder %>%     # the pipe operator
  slice(1:6) %>%  # slice of rows 1 to 6
  select(country, pop, year)  # select these columns

# Basic plotting:
hist(gapminder$lifeExp, breaks=10)
boxplot(gapminder$lifeExp, ylab="Life expectancy")
plot(pop ~ lifeExp, data=gapminder)

While the base plots work just fine in R, you might have noticed the syntax is not the most straightforward, nor are the default looks particularly appealing. We will therefore look into using a better plotting package below instead.


ggplot2

We’ll now switch to an alternative plotting package, ggplot2. It uses a different approach to plotting, and a slightly different syntax. It also comes with default colors and aesthetics which many people find nicer than those of the base plot(). A particularly useful feature of ggplot2 is its extendability (or rather the fact people are eager to extend it), with an ever-growing list of addon-packages on CRAN with an extended selection of themes and more niche visualization methods. It’s also easy to make ggplots interactive using the plotly package (examples later).

Let’s build our first ggplot up layer by layer:

library(gapminder)
library(ggplot2) 
library(dplyr)
# we actually already loaded these; ggplot2 and dplyr are part of tidyverse
# but I included package loading calls in code blocs where a new package is introduced - so you can keep track which new functions come from which package.


# You can think of ggplot as putting layers of different elements on a canvas:
ggplot()  # 1. The "empty canvas"; calling this just plots an empty plot

ggplot(data=gapminder)  # 2. let's add the data argument (but it's still an empty plot, as we have not asked for any variables to be shown)

ggplot(data=gapminder, mapping=aes(x=lifeExp, y=pop)) # 3. Define the x and y axes using aes() 

# (which stands for "aesthetic mappings") - this makes something show up.

# 4. Now let's add a "geom" - these are layers that plot thnings like points or lines.
# Layers are "added" using the + operator
# But also they only work if you use the +, just putting the lines of code under one another won't be enough.
ggplot(gapminder, aes(x=lifeExp, y=pop))+  # notice the plus
  geom_point()   

# If provided in the correct order, parameter names can be omitted; so I won't be spelling out the data= part here anymore.

# 4.1. I also want you to get into the habit of putting a NULL in the end of ggplot code blocks: it doesn't do anything (being null), but this extra bit will save you from a certain fairly confusing error which happens if you would accidentally leave a trailing + in the end of a ggplot call. Make sure the NULL is always in the very very end, and never put a "+" after the NULL.
ggplot(gapminder, aes(x=lifeExp, y=pop))+
  geom_point()+
  NULL

# Sanity check: what are we actually looking at? What are the variabels and their relationship?


# 5. We can add parameters to the geoms, like color, fill, size, alpha/transparency:
ggplot(gapminder, aes(x=lifeExp, y=pop))+
  geom_point(color="purple", size=2, alpha=0.2)+  
  NULL

# "purple" is a string (a color name), so use quotes


# 6. Let's color the points conditionally by continent instead: 
# for that, define the coloring variable in the aes() function:
ggplot(gapminder, aes(x=lifeExp, y=pop, color=continent))+
  geom_point()+ 
  NULL

# Defining variables for color, fill, size and shape automatically generate a legend.

# So explicit color like "blue" or "purple" or size=1.2 goes into the geom_point() options; but if you want to color a geom by the values of a variable, do that in the aes(). So aes(color="blue") won't work, but aes(color=continent) works. Only variable names go in the aes(). 
# We'll also see below how to change the colors of a conditional coloring palette.


# 7. The order of things in a ggplot block matters:
ggplot(gapminder, aes(x=lifeExp, y=pop, color=continent))+
  geom_point(color="red", size=2, shape=15)+ 
  geom_point(color="blue", size=0.5)+ 
  NULL

# The little blue circles are plotted on top of the bigger red squares, because the blue circles line of code is after the red points line.

# 8. Let's also try adding a different "theme", a more minimal one; it will replace the default.
ggplot(gapminder, aes(x=lifeExp, y=pop, color=continent))+
  geom_point()+
  theme_minimal()+                # add before before theme()
  theme(legend.position = "top")+ # add after theme_* if one is present
  NULL

# Again, order matters: if using the theme-modifying function theme(), it needs to go after the line that defines the overall theme.

All ggplot2 functions are just like other R funtions, they have a name (such as geom_point), parameters ( color="blue", size=2), the parameters are separated with commas, and surrounded by brackets: geom_point(color="blue", size=2). The only difference is that the functions are joined into a block using the + operator. Pressing the tab key while inside the function brackets pops up the available parameter list.

Exercise

This is the same dataset as above, but we’re using a subset of the data, just Europe for now (subsetted using the filter() function from the dplyr package). You’ll notice that some the exercises have multiple possible solutions - feel free to experiment!

Do the following exercises one by one: do the requested addition or change, and then run the code to see how it looks. If you get an error, it’s probably something in the new piece of code you just added or changed. Make sure to only write code inside code blocks, not out here.

Important: when you’re done with the last exercise (or the last exercise you finished when the instructor told the class to start uploading solutions), take a screenshot of your plot or use the Export > Copy to clipboard option above the plot pane, then paste your result here on this common clipboard: https://hackmd.io/@andreskarjus/SkTrzBZgK/edit

If you get an error you can’t solve, raise a hand and ask for help. I also set up a troubleshooting clipboard: describe your error there and then ask for assistance over the chat: https://hackmd.io/@andreskarjus/HyaRgdbxY/edit

Recap: plots go here: https://hackmd.io/@andreskarjus/SkTrzBZgK/edit errors go here: https://hackmd.io/@andreskarjus/HyaRgdbxY/edit

The exercises:

  1. We already tried coloring by continent. Now explore the data by coloring by country or year, and try to interpret the plot that comes out (discuss with your neighbour if you’re sitting together with somebody). Remember, explicit colors such as “blue” are defined in the geom, while conditional coloring (such as “color by country”) are defined in the aes() or aesthetic mapping function.
ggplot(data = gapminder %>% filter(continent=="Europe"), 
       mapping = aes(
         x=year, # life expectancy on the x axis
         y=lifeExp,     # population on the y axis
         # here: define which variable should be mapped to color:
         
         #
         )) + 
  # add geoms, scales and themes between here...
  geom_point(  )+   # modify the points by adding parameters between the ( )

  # ...and here
  NULL  # while keeping this in the end

Extra exercise if you’re fast: - If coloring by year, try using scale_colour_viridis_c() instead (scales are added with a + just like geoms). - There’s a lot of points there - either make them a bit smaller by adding size=0.7 into geom_point(), or make them transparent (alpha=0.7), or change the point shape to an empty circle (shape=1), or all of these options. - It’s easy to change axis labels too: add the labs(x="", y="", title="") layer and specify the titles in the quotes. - Try removing or moving the legend using + theme(), specifying the legend.position parameter with value “none”, “bottom”, etc (with quotes). Make sure the theme() layer comes after theme_bw(), theme_minimal() etc, as theme() modifies the options of those.

Even more extra exercises if you’re super duper fast: - Try replacing points with text: instead of points, use geom_text(aes(label=country), size=2, hjust=1) here (the hjust argument makes the labels right-aligned, which looks nicer than the default center-alignment). - Probably doesn’t make sense to color by country any more, so color by year instead. - Extra: you can also try removing the points by deleting or commenting out the geom_point() line, or making the points either smaller or less transparent. - Extra: try changing the default theme for another one like theme_bw() (themes are added with a + like other layers) - Try labeling with geom text, but so that only the last year (2007) is labelled, by defining the data argument separately in the geom_text() call, and setting its value to a subset like this: data=gapminder %>% filter(continent=="Europe", year==2007) - Or use shorter labels: a quick way to do it is to use the substring function, which you can use right inside the aes() in geom_text() like this: aes(label=substr(country,1,3)) - you can now either remove the points or just make them smaller (e.g. size=0.2). - Try more themes and color scales.

Finally, copy-paste your plot to https://hackmd.io/@andreskarjus/SkTrzBZgK/edit

Solution (don’t look here yet;)

# These are some possible solutions to the exercises above

# 1.
ggplot(data = gapminder %>% filter(continent=="Europe"), 
       mapping = aes(
         x=lifeExp, # life expectancy on the x axis
         y=pop,     # population on the y axis
         color=year
         )) + 
  geom_point(size=0.7, alpha=0.7)+  
  scale_colour_viridis_c()+
  labs(x="Life expectancy", 
       y="Population size", 
       title="Europe 1952-2007")+
  theme_bw()+
  NULL

# 2.
ggplot(data = gapminder %>% filter(continent=="Europe"), 
       mapping = aes(
         x=lifeExp, # life expectancy on the x axis
         y=pop,     # population on the y axis
         color=country
         )) + 
  geom_point(size=0.2)+  
  geom_text(aes(label=substr(country,1,3)), size=3, hjust=1)+
  
  labs(x="Life expectancy", 
       y="Population size", 
       title="Europe 1952-2007")+
  theme_bw()+
  NULL

Exercise 2

Let’s try doing a quick timeseries plot as well

# it's basically all the same, just this time we use a different geom for lines. Let's also try a new continent.


ggplot(data = gapminder %>% filter(continent=="Oceania"), 
       mapping = aes(
         x=year,
         y=lifeExp,
         color=country
         )) + 
  # add geoms, scales and themes between here...
  geom_line()+   # New!

  theme(legend.position = "none")+
  # ...and here
  NULL  # while keeping this in the end

So this plot gives life expectancy over the years; I’ve hidden the legend because there’s way too many countries, the legend would be huge. Let’s add country labels directly on the plot instead: - geom_text(aes(label=country), data=gapminder %>% filter(continent=="Oceania", year==2007), hjust=0, size=2)+ - that adds labels, but because we filter the data here once more, it labels only the last year. - The labels are hard to see though! Let’s give them some space; add this: scale_x_continuous(expand = expansion(mult = c(0, 0.2)))+ - Extra: you can also add points on top of the lines - try adding the geom_point() back in. - And the axis labels could be nicer, fix them using labs()

When done, paste your plot to https://hackmd.io/@andreskarjus/SkTrzBZgK/edit


Quick sneak peek: interactive plots

library(plotly)

# save the plot as an object (that's what the assignement operator "=" does)
g = ggplot(data = gapminder %>% filter(continent=="Oceania"), 
       mapping = aes(x=year, y=lifeExp, color=country)) + 
  geom_line()+
  geom_point()+
  geom_text(aes(label=country), data=gapminder %>% filter(continent=="Oceania", year==2007), hjust=0, size=2)+
  scale_x_continuous(expand = expansion(mult = c(0, 0.2)))+ 
  theme(legend.position = "none")+
  # ...and here
  NULL  # while keeping this in the end
g # take a look

ggplotly(g) # run this; hover your cursor over the plot that appears.

The Latvian publishing dataset

For this workshop, we’ve been provided a books and comics publication dataset by the summer school organizers. We’re going to download the data file from github below.

library(readr) # part of tidyverse
books = read_delim("https://raw.githubusercontent.com/andreskarjus/artofthefigure/master/riga2022/publications_eng.txt") # uses the readr package to both download and read the file
## Rows: 1120 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (8): Authors, Title, Subtitle, Placeofpublishing, Publisher, Typeofdocum...
## dbl (2): Year, ISBN
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read.table(text=readLines("https://raw.githubusercontent.com/andreskarjus/artofthefigure/master/riga2022/publications_eng.txt", n = 10, encoding = "UTF-8"), sep="\t") # debug option using base R, if the above doesn't work

Exploring the data

# Let's see what's in there:
head(books)   # <- base R vs tidyverse pipe notation:
## # A tibble: 6 x 10
##    Year Authors Title Subtitle    ISBN Placeofpublishi~ Publisher Typeofdocument
##   <dbl> <chr>   <chr> <chr>      <dbl> <chr>            <chr>     <chr>         
## 1  2017 Baeza,~ Brume <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 2  2017 Booger~ Nul   <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 3  2017 Gheluw~ Spec~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 4  2017 Lima, ~ Sutr~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 5  2017 Lacko,~ Call~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 6  2017 Pallas~ Mirr~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## # ... with 2 more variables: Language <chr>, Numberofpages <chr>
books %>% head() # pipe: take what is on the left and use as input on the right
## # A tibble: 6 x 10
##    Year Authors Title Subtitle    ISBN Placeofpublishi~ Publisher Typeofdocument
##   <dbl> <chr>   <chr> <chr>      <dbl> <chr>            <chr>     <chr>         
## 1  2017 Baeza,~ Brume <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 2  2017 Booger~ Nul   <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 3  2017 Gheluw~ Spec~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 4  2017 Lima, ~ Sutr~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 5  2017 Lacko,~ Call~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 6  2017 Pallas~ Mirr~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## # ... with 2 more variables: Language <chr>, Numberofpages <chr>
# The shortcut to insert a pipe is CTRL+SHIFT+M (PC) or CMD+SHIFT+M (Mac)

# if this gives you an error "object not found" - you need to import the data first, see code block above!

books   # it's a "tibble", an improvement over R base dataframes
## # A tibble: 1,120 x 10
##     Year Authors                 Title Subtitle    ISBN Placeofpublishi~ Publisher
##    <dbl> <chr>                   <chr> <chr>      <dbl> <chr>            <chr>    
##  1  2017 "Baeza, Amanda"         Brume <NA>     9.79e12 Riga             Grafiski~
##  2  2017 "Booger, Olive"         Nul   <NA>     9.79e12 Riga             Grafiski~
##  3  2017 "Gheluwe, Mathilde Van" Spec~ <NA>     9.79e12 Riga             Grafiski~
##  4  2017 "Lima, Daniel"          Sutr~ <NA>     9.79e12 Riga             Grafiski~
##  5  2017 "Lacko, Martin"         Call~ <NA>     9.79e12 Riga             Grafiski~
##  6  2017 "Pallasvuo, Jaakko"     Mirr~ <NA>     9.79e12 Riga             Grafiski~
##  7  2017 "Serrao, Catia "        Acqu~ <NA>     9.79e12 Riga             Grafiski~
##  8  2017 "Kandevica, Liva "      Yell~ <NA>     9.79e12 Riga             Grafiski~
##  9  2017 "Samplerman"            Bad ~ <NA>     9.79e12 Riga             Grafiski~
## 10  2017 "Ellsworth, Theo"       An e~ <NA>     9.79e12 Riga             Grafiski~
## # ... with 1,110 more rows, and 3 more variables: Typeofdocument <chr>,
## #   Language <chr>, Numberofpages <chr>
# Let's do some basic data wrangling and summarizing:

# count and rank publishers
books %>% 
  count(Publisher) %>% 
  arrange(-n)   # minus makes it descending order
## # A tibble: 79 x 2
##    Publisher                                 n
##    <chr>                                 <int>
##  1 EGMONT LATVIJA                          426
##  2 Apgads Zvaigzne ABC                     369
##  3 Grafiskie stasti                         48
##  4 Latvijas Makslas akademija               46
##  5 Daugavpils Marka Rotko makslas centrs    41
##  6 Popper                                   28
##  7 LATVIJAS MEDIJI                          24
##  8 MADRIS                                   14
##  9 RAFFA                                     9
## 10 Brangi                                    8
## # ... with 69 more rows
arrange(count(books, Publisher),desc(n)) # this would also work, but the pipes are easier to read, right?v
## # A tibble: 79 x 2
##    Publisher                                 n
##    <chr>                                 <int>
##  1 EGMONT LATVIJA                          426
##  2 Apgads Zvaigzne ABC                     369
##  3 Grafiskie stasti                         48
##  4 Latvijas Makslas akademija               46
##  5 Daugavpils Marka Rotko makslas centrs    41
##  6 Popper                                   28
##  7 LATVIJAS MEDIJI                          24
##  8 MADRIS                                   14
##  9 RAFFA                                     9
## 10 Brangi                                    8
## # ... with 69 more rows
# count and rank publishing locations
books %>% 
  count(Placeofpublishing) %>% 
  arrange(-n)  
## # A tibble: 23 x 2
##    Placeofpublishing     n
##    <chr>             <int>
##  1 Riga               1038
##  2 Daugavpils           45
##  3 Garkalnes novads      8
##  4 Liepaja               5
##  5 Siguldas novads       3
##  6 Cesu novads           2
##  7 Kekavas novads        2
##  8 Ventspils             2
##  9 Alsungas novads       1
## 10 Burtnieku novads      1
## # ... with 13 more rows
# group_by, summarise
books %>% 
  group_by(Language) %>% 
  summarise(meanpages = mean(Numberofpages, na.rm=T)) 
## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA

## Warning in mean.default(Numberofpages, na.rm = T): argument is not numeric or
## logical: returning NA
## # A tibble: 14 x 2
##    Language                         meanpages
##    <chr>                                <dbl>
##  1 Anglu                                   NA
##  2 Anglu; Igaunu; Krievu                   NA
##  3 Anglu; Latviešu                         NA
##  4 Anglu; Latviešu; Igaunu                 NA
##  5 Krievu                                  NA
##  6 Krievu; Latviešu; Anglu                 NA
##  7 Latviešu                                NA
##  8 Latviešu; Anglu                         NA
##  9 Latviešu; Anglu; Krievu                 NA
## 10 Latviešu; Francu                        NA
## 11 Latviešu; Igaunu; Anglu                 NA
## 12 Latviešu; Krievu                        NA
## 13 Latviešu; Krievu; Anglu; Kiniešu        NA
## 14 Latviešu; Zviedru; Italu                NA
# :( this doesn't seem to work, everything is NA! But it looked like the column contained numbers, why can't do take the mean...?

# Let's investigate
class(books$Numberofpages) # aha!
## [1] "character"
# and let's fix, saving the outcome as a new object
books2 = books %>% 
  mutate(pages = as.numeric(Numberofpages)) 
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
# mutate changes and creates variables
# This will warn that some entries were turned to NA - those that contained text and could not be turned into numbers. Let's look closer:

books2 %>% 
  select(Numberofpages, pages) %>%  # select the old and new variable
  filter(is.na(pages)) # that's ok, just 4 entries
## # A tibble: 4 x 2
##   Numberofpages pages
##   <chr>         <dbl>
## 1 <NA>             NA
## 2 1 karba          NA
## 3 <NA>             NA
## 4 <NA>             NA
# Let's remove them:

books3 = books2 %>% filter(!is.na(pages))

# Now we can do the page mean!
books3 %>% 
  group_by(Language) %>% 
  summarise(n=n(), meanpages = mean(pages, na.rm=T)) %>%  # counts and means
  arrange(-meanpages)
## # A tibble: 14 x 3
##    Language                             n meanpages
##    <chr>                            <int>     <dbl>
##  1 Latviešu; Krievu                     2     144  
##  2 Latviešu; Krievu; Anglu; Kiniešu     1     100  
##  3 Anglu; Latviešu                      1      96  
##  4 Latviešu; Igaunu; Anglu              1      80  
##  5 Latviešu; Anglu                     42      42.9
##  6 Anglu                               80      42.5
##  7 Latviešu; Francu                     1      40  
##  8 Latviešu; Anglu; Krievu              7      34  
##  9 Anglu; Igaunu; Krievu                1      32  
## 10 Anglu; Latviešu; Igaunu              1      32  
## 11 Krievu; Latviešu; Anglu              1      32  
## 12 Latviešu                           976      29.1
## 13 Latviešu; Zviedru; Italu             1      28  
## 14 Krievu                               1      16

Let’s plot

# Why not visualize the whole distribution
# This will only work if you ran all the code in the previous code block, otherwise you'll get the "object not found error"
ggplot(books3, aes(y=Language, x=pages)) +
  geom_boxplot()

# Some categories are very small though, just one entry. We can filter the smaller ones out:
ggplot(books3 %>% group_by(Language) %>% filter(n()>10), 
       aes(y=Language, x=pages)) +
  geom_boxplot()

# Another way to visualize distributions is using a violin density plot
ggplot(books3 %>% group_by(Language) %>% filter(n()>10), 
       aes(x=Language, y=pages)) +
  geom_violin()

# We can also combine them:
ggplot(books3 %>% group_by(Language) %>% filter(n()>10), 
       aes(x=Language, y=pages)) +
  geom_violin(color=NA, fill="gray")+
  geom_boxplot(width=0.05, fill=NA, outlier.size = 0.2)+
  theme_bw()

# help(geom_violin) # if you're interested


# We could also look into publishers:
ggplot(books3 %>% group_by(Publisher) %>% filter(n()>10),  # again removing the smaller ones
       aes(y=Publisher)) +
  geom_bar()

# Or publication types
ggplot(books3, aes(y=Typeofdocument)) +
  geom_bar()

# Notice that variable is quite messy!

# We could try to fix it a bit
ggplot(books3 %>% mutate(Typeofdocument=tolower(Typeofdocument)), 
       aes(y=Typeofdocument)) +
  geom_bar()+
  theme(axis.title.y = element_blank())  # switches off theme elements

Exercise 3

More plots!

# for this to work, you need to have imported the data; if you didn't do this above, do it now:
books = read_delim("https://raw.githubusercontent.com/andreskarjus/artofthefigure/master/riga2022/publications_eng.txt")
## Rows: 1120 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (8): Authors, Title, Subtitle, Placeofpublishing, Publisher, Typeofdocum...
## dbl (2): Year, ISBN
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create a new object:
publishers = books %>% 
  mutate(pages=suppressWarnings(as.numeric(Numberofpages))) %>% 
  group_by(Publisher) %>% 
  filter(n()>10, !is.na(pages)) 
# ignore the warning about numeric

publishers[1,] # quick look
## # A tibble: 1 x 11
## # Groups:   Publisher [1]
##    Year Authors       Title Subtitle    ISBN Placeofpublishi~ Publisher Typeofdocument
##   <dbl> <chr>         <chr> <chr>      <dbl> <chr>            <chr>     <chr>         
## 1  2017 Baeza, Amanda Brume <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## # ... with 3 more variables: Language <chr>, Numberofpages <chr>, pages <dbl>
# Template:

publishers2 = publishers # %>%  # uncomment the pipe and start adding steps here

Exercises: - Let’s create a depicting the volume and median book length for each publisher; take the publishers object and create two new variables for plotting: - first: group_by(Publisher) - then use the summarise() function to create these variables: - medianpages=median(pages), nbooks=n()

# then when you're done, plot:
ggplot(publishers2, 
       aes(x=medianpages, y=nbooks, label=Publisher)) +
  geom_point()

Exercises: - This plots the publishers but we don’t know which is which. Either color the dots by publisher, or add labels to the plot using geom_text() - if the latter, you might want to use the hjust=0 parameter for better text positioning. - Add nicer x and y axis labels and a title using labs() - Discuss how the publishers differ with the person sitting next to you.

When done, paste your plot to https://hackmd.io/@andreskarjus/SkTrzBZgK/edit

Wrangling messy data

library(tidyr) # part of tidyverse; provides separate_rows and separate
library(ggbeeswarm) # provides an additional geom for ggplot2

# for this to work, you need to have imported the data; if you didn't do this above, do it now:
books = read_delim("https://raw.githubusercontent.com/andreskarjus/artofthefigure/master/riga2022/publications_eng.txt") # only do it if you didn't yet before
## Rows: 1120 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (8): Authors, Title, Subtitle, Placeofpublishing, Publisher, Typeofdocum...
## dbl (2): Year, ISBN
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
books %>% select(Authors) %>% slice(18) # problem, multiple authors on the same row (multi-author books)
## # A tibble: 1 x 1
##   Authors                       
##   <chr>                         
## 1 Bulling, Paula; Hoffmann, Nina
# Let's split:
book_authors = books %>% 
  mutate(pages=suppressWarnings(as.numeric(Numberofpages))) %>% 
  separate_rows(Authors, sep = "; ")

nrow(books)
## [1] 1120
nrow(book_authors) # ok so not that many of them, but now we can do other operations
## [1] 1137
book_authors %>% slice(18:19) 
## # A tibble: 2 x 11
##    Year Authors        Title Subtitle    ISBN Placeofpublishi~ Publisher Typeofdocument
##   <dbl> <chr>          <chr> <chr>      <dbl> <chr>            <chr>     <chr>         
## 1  2017 Bulling, Paula Shar~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## 2  2017 Hoffmann, Nina Shar~ <NA>     9.79e12 Riga             Grafiski~ Komiksi       
## # ... with 3 more variables: Language <chr>, Numberofpages <chr>, pages <dbl>
# Let's further split first and last names
authors_split = book_authors %>% 
  separate(Authors, into = c("Lastname", "Firstname"), sep = ", ", remove = F)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 13 rows [9, 12,
## 17, 25, 39, 54, 304, 305, 609, 757, 947, 948, 1010].
authors_split[9, ] # warnings comes from some missing values and last name only
## # A tibble: 1 x 13
##    Year Authors    Lastname   Firstname Title    Subtitle          ISBN Placeofpublishi~
##   <dbl> <chr>      <chr>      <chr>     <chr>    <chr>            <dbl> <chr>           
## 1  2017 Samplerman Samplerman <NA>      Bad Ball <NA>     9789934518591 Riga            
## # ... with 5 more variables: Publisher <chr>, Typeofdocument <chr>,
## #   Language <chr>, Numberofpages <chr>, pages <dbl>
authors_split # %>%   # Mini-exercise: how to count rows with NA first name?
## # A tibble: 1,137 x 13
##     Year Authors   Lastname  Firstname  Title  Subtitle    ISBN Placeofpublishi~
##    <dbl> <chr>     <chr>     <chr>      <chr>  <chr>      <dbl> <chr>           
##  1  2017 "Baeza, ~ Baeza     "Amanda"   Brume  <NA>     9.79e12 Riga            
##  2  2017 "Booger,~ Booger    "Olive"    Nul    <NA>     9.79e12 Riga            
##  3  2017 "Gheluwe~ Gheluwe   "Mathilde~ Spect~ <NA>     9.79e12 Riga            
##  4  2017 "Lima, D~ Lima      "Daniel"   Sutra~ <NA>     9.79e12 Riga            
##  5  2017 "Lacko, ~ Lacko     "Martin"   Call ~ <NA>     9.79e12 Riga            
##  6  2017 "Pallasv~ Pallasvuo "Jaakko"   Mirro~ <NA>     9.79e12 Riga            
##  7  2017 "Serrao,~ Serrao    "Catia "   Acqui~ <NA>     9.79e12 Riga            
##  8  2017 "Kandevi~ Kandevica "Liva "    Yellow <NA>     9.79e12 Riga            
##  9  2017 "Sampler~ Samplerm~  <NA>      Bad B~ <NA>     9.79e12 Riga            
## 10  2017 "Ellswor~ Ellsworth "Theo"     An ex~ <NA>     9.79e12 Riga            
## # ... with 1,127 more rows, and 5 more variables: Publisher <chr>,
## #   Typeofdocument <chr>, Language <chr>, Numberofpages <chr>, pages <dbl>
authors_split # %>%   # Mini-exercise: how to count and rank the first names
## # A tibble: 1,137 x 13
##     Year Authors   Lastname  Firstname  Title  Subtitle    ISBN Placeofpublishi~
##    <dbl> <chr>     <chr>     <chr>      <chr>  <chr>      <dbl> <chr>           
##  1  2017 "Baeza, ~ Baeza     "Amanda"   Brume  <NA>     9.79e12 Riga            
##  2  2017 "Booger,~ Booger    "Olive"    Nul    <NA>     9.79e12 Riga            
##  3  2017 "Gheluwe~ Gheluwe   "Mathilde~ Spect~ <NA>     9.79e12 Riga            
##  4  2017 "Lima, D~ Lima      "Daniel"   Sutra~ <NA>     9.79e12 Riga            
##  5  2017 "Lacko, ~ Lacko     "Martin"   Call ~ <NA>     9.79e12 Riga            
##  6  2017 "Pallasv~ Pallasvuo "Jaakko"   Mirro~ <NA>     9.79e12 Riga            
##  7  2017 "Serrao,~ Serrao    "Catia "   Acqui~ <NA>     9.79e12 Riga            
##  8  2017 "Kandevi~ Kandevica "Liva "    Yellow <NA>     9.79e12 Riga            
##  9  2017 "Sampler~ Samplerm~  <NA>      Bad B~ <NA>     9.79e12 Riga            
## 10  2017 "Ellswor~ Ellsworth "Theo"     An ex~ <NA>     9.79e12 Riga            
## # ... with 1,127 more rows, and 5 more variables: Publisher <chr>,
## #   Typeofdocument <chr>, Language <chr>, Numberofpages <chr>, pages <dbl>
# Ok that's interesting... maybe we can quantify this, knowing a bit about how Latvian names work...
authorgender = authors_split %>% 
  mutate(lastletter = substr(Firstname, nchar(Firstname), nchar(Firstname)),
         genderguess = case_when(
                      is.na(Firstname) ~ NA_character_, # if no name, can't guess
           lastletter %in% c("a", "e") ~ "F",   # if a or e, then F,
                                     T ~ "M")   # otherwise M
         )
authorgender %>% count(genderguess)
## # A tibble: 3 x 2
##   genderguess     n
##   <chr>       <int>
## 1 F             158
## 2 M              67
## 3 <NA>          912
# Let's plot; also let's try a new geom, the beeswarm
ggplot(authorgender %>% filter(!is.na(genderguess)), 
       aes(x=genderguess, y=pages  ))+
  geom_beeswarm()+
  scale_y_continuous(limits=c(0,200))
## Warning: Removed 4 rows containing missing values (position_beeswarm).

# Mini-exercise: how could we color the dots by year?

Exercise 4

Remember the language variable, it also had multiple values per row? - Do the same here that we did with names, split multi-language entries across separate rows using separate_rows(). - Save the results in a new object (pick a name yourself), then use ggplot to visualize something interesting about the distribution of languages in the dataset.

books %>% count(Language) # this variable
## # A tibble: 14 x 2
##    Language                             n
##    <chr>                            <int>
##  1 Anglu                               80
##  2 Anglu; Igaunu; Krievu                1
##  3 Anglu; Latviešu                      1
##  4 Anglu; Latviešu; Igaunu              1
##  5 Krievu                               1
##  6 Krievu; Latviešu; Anglu              1
##  7 Latviešu                           980
##  8 Latviešu; Anglu                     42
##  9 Latviešu; Anglu; Krievu              7
## 10 Latviešu; Francu                     1
## 11 Latviešu; Igaunu; Anglu              1
## 12 Latviešu; Krievu                     2
## 13 Latviešu; Krievu; Anglu; Kiniešu     1
## 14 Latviešu; Zviedru; Italu             1
# Template: start adding to this pipeline. Save the results as a new object using =
books %>% 
  mutate(pages=suppressWarnings(as.numeric(Numberofpages))) # %>% 
## # A tibble: 1,120 x 11
##     Year Authors                 Title Subtitle    ISBN Placeofpublishi~ Publisher
##    <dbl> <chr>                   <chr> <chr>      <dbl> <chr>            <chr>    
##  1  2017 "Baeza, Amanda"         Brume <NA>     9.79e12 Riga             Grafiski~
##  2  2017 "Booger, Olive"         Nul   <NA>     9.79e12 Riga             Grafiski~
##  3  2017 "Gheluwe, Mathilde Van" Spec~ <NA>     9.79e12 Riga             Grafiski~
##  4  2017 "Lima, Daniel"          Sutr~ <NA>     9.79e12 Riga             Grafiski~
##  5  2017 "Lacko, Martin"         Call~ <NA>     9.79e12 Riga             Grafiski~
##  6  2017 "Pallasvuo, Jaakko"     Mirr~ <NA>     9.79e12 Riga             Grafiski~
##  7  2017 "Serrao, Catia "        Acqu~ <NA>     9.79e12 Riga             Grafiski~
##  8  2017 "Kandevica, Liva "      Yell~ <NA>     9.79e12 Riga             Grafiski~
##  9  2017 "Samplerman"            Bad ~ <NA>     9.79e12 Riga             Grafiski~
## 10  2017 "Ellsworth, Theo"       An e~ <NA>     9.79e12 Riga             Grafiski~
## # ... with 1,110 more rows, and 4 more variables: Typeofdocument <chr>,
## #   Language <chr>, Numberofpages <chr>, pages <dbl>

When done, paste your plot to https://hackmd.io/@andreskarjus/SkTrzBZgK/edit

Words

library(quanteda)
library(quanteda.textplots)

lvstop = read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-lv/master/raw/ranksnl-latvian.txt") # download some latvian stopwords

titles_dfm = books$Title %>% 
  tokens(remove_punct = T, remove_symbols = T, remove_numbers = T) %>% 
  tokens_tolower() %>% 
  tokens_remove(c(stopwords("en"), lvstop)) %>% 
  dfm()

titles_dfm # have a look - it's a document-term matrix
## Document-feature matrix of: 1,120 documents, 1,102 features (99.80% sparse) and 0 docvars.
##        features
## docs    brume nul spectacular vermacular sutrama call cthulhu mirror stage
##   text1     1   0           0          0       0    0       0      0     0
##   text2     0   1           0          0       0    0       0      0     0
##   text3     0   0           1          1       0    0       0      0     0
##   text4     0   0           0          0       1    0       0      0     0
##   text5     0   0           0          0       0    1       1      0     0
##   text6     0   0           0          0       0    0       0      1     1
##        features
## docs    acquisition
##   text1           0
##   text2           0
##   text3           0
##   text4           0
##   text5           0
##   text6           0
## [ reached max_ndoc ... 1,114 more documents, reached max_nfeat ... 1,092 more features ]
topfeatures(titles_dfm) # pull most frequent words
##     kraso     sirds dzivnieki    little      pony      vagi     ledus      maša 
##        27        26        26        24        24        24        23        22 
##     lacis princeses 
##        22        21
textplot_wordcloud(titles_dfm, 
                   min_size = 2, max_size = 10, 
                   color=terrain.colors(10) ) %>% suppressWarnings()

# adjust the size variables if it looks too small or big on your screen

# This is definitely not a very scientific visualization, but can serve as a nice illustration once in a while.

Making things interactive

library(plotly)    # for doing interactive plots
# plotly can be used to create the same sorts of plots as you've done with the ggplot() function, except interactive. 
# It can be used to create interactive plots from scratch, or to convert (most) ggplots. 

# for this to work, you need to have imported the data; if you didn't do this above, do it now:
books = read_delim("https://raw.githubusercontent.com/andreskarjus/artofthefigure/master/riga2022/publications_eng.txt") # only do it if you didn't yet before
## Rows: 1120 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (8): Authors, Title, Subtitle, Placeofpublishing, Publisher, Typeofdocum...
## dbl (2): Year, ISBN
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# As an example, let's calculate the median book length for top publishers over the years
pagesyears = books %>% 
  mutate(pages=suppressWarnings(as.numeric(Numberofpages))) %>% 
  group_by(Publisher) %>% 
  filter(n()>10) %>%
  group_by(Publisher, Year) %>%
  summarise(nbooks=n(), medianpages=median(pages, na.rm=T))
## `summarise()` has grouped output by 'Publisher'. You can override using the `.groups` argument.
g = ggplot(pagesyears, aes(x=Year, y=medianpages, color=Publisher))+
  geom_line()+
  geom_point(aes(size=nbooks))

g # have a look

ggplotly(g) # convert to interactive

Heatmaps

Sometimes it could be useful to see how different variables interact

# for this to work, you need to have imported the data; if you didn't do this above, do it now:
books = read_delim("https://raw.githubusercontent.com/andreskarjus/artofthefigure/master/riga2022/publications_eng.txt") # only do it if you didn't yet before
## Rows: 1120 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (8): Authors, Title, Subtitle, Placeofpublishing, Publisher, Typeofdocum...
## dbl (2): Year, ISBN
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Let's see how genre and publisher interact

genrelang = books %>% 
  separate_rows(Language, sep="; ") %>% 
  mutate(Genre = tolower(Typeofdocument)) %>% 
  group_by(Publisher) %>% 
  filter(n()>5) %>%         # remove the very small ones
  group_by(Publisher, Genre) %>%   # regroup to count interactions
  count()
genrelang # have a look
## # A tibble: 23 x 3
## # Groups:   Publisher, Genre [23]
##    Publisher                             Genre                                 n
##    <chr>                                 <chr>                             <int>
##  1 Apgads Zvaigzne ABC                   komiksi                               3
##  2 Apgads Zvaigzne ABC                   krasojamas gramatas pieaugušajie~     6
##  3 Apgads Zvaigzne ABC                   makslas albumi                        1
##  4 Apgads Zvaigzne ABC                   makslas katalogi                      1
##  5 Apgads Zvaigzne ABC                   praktiska, attistoša literatura ~   358
##  6 Brangi                                praktiska, attistoša literatura ~     8
##  7 Daugavpils Marka Rotko makslas centrs makslas izstažu katalogi             81
##  8 Daugavpils Marka Rotko makslas centrs makslas katalogi                     10
##  9 EGMONT LATVIJA                        komiksi                               1
## 10 EGMONT LATVIJA                        praktiska, attistoša literatura ~   425
## # ... with 13 more rows
ggplot(genrelang, aes(x=Genre, y=Publisher, fill=n))+
  geom_tile(color="black")+
  
  NULL

Exercises

  • You can’t really see the x axis values; set a better angle by adding: theme(axis.text.x = element_text(angle=45,hjust=1,vjust=1)) - and always make sure theme() comes after commands like theme_bw()
  • Since there are not that many cells, we could add the counts directly to the plot, so it becomes like a colored table of sorts: add geom_text(aes(label=n), color="white") and remove the now useless fill legend - either by specifying guide="none" in the scale_fill function, or by specifying legend.position="none" using theme()
  • The color palette is skewed by the few very large values, which makes the differences between smaller values harder to see. One way to fix it would be to use a logarithmic transformation, which is easy to do directly in the color fill function; add scale_fill_continuous(trans="log")
  • The grid is a bit useless here, and since some cells are empty, it shows through. You can remove it using theme(panel.grid=element_blank())
  • Inspect the visualization and discuss it if you’re doing this sitting together with somebody. Obviously in this example the heatmap is quite small, but this approach can be very useful to get an overview of larger datasets.

When done, paste your plot to https://hackmd.io/@andreskarjus/SkTrzBZgK/edit

Networks

library(visNetwork)
# for this to work, you need to have imported the data; if you didn't do this above, do it now:
books = read_delim("https://raw.githubusercontent.com/andreskarjus/artofthefigure/master/riga2022/publications_eng.txt") # only do it if you didn't yet before
## Rows: 1120 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (8): Authors, Title, Subtitle, Placeofpublishing, Publisher, Typeofdocum...
## dbl (2): Year, ISBN
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# recreate the split authors variable; this time let's also clean extra whitespaces, and filter out single-publication publishers
book_authors2 = books %>% 
  separate_rows(Authors, sep = "; +") %>% 
  mutate(Authors = gsub(" +", "", Authors)) %>% 
  filter(!is.na(Authors)) %>% 
  group_by(Publisher) %>% filter(n()>1) %>% ungroup()

book_authors2 %>% select(Authors, Publisher) # how could we visualize these connections?
## # A tibble: 219 x 2
##    Authors             Publisher       
##    <chr>               <chr>           
##  1 Baeza,Amanda        Grafiskie stasti
##  2 Booger,Olive        Grafiskie stasti
##  3 Gheluwe,MathildeVan Grafiskie stasti
##  4 Lima,Daniel         Grafiskie stasti
##  5 Lacko,Martin        Grafiskie stasti
##  6 Pallasvuo,Jaakko    Grafiskie stasti
##  7 Serrao,Catia        Grafiskie stasti
##  8 Kandevica,Liva      Grafiskie stasti
##  9 Samplerman          Grafiskie stasti
## 10 Ellsworth,Theo      Grafiskie stasti
## # ... with 209 more rows
edges = book_authors2 %>% 
  rename(from=Authors, to=Publisher) %>% 
  group_by(from, to) %>% 
  count(name="size")

nodes = rbind(
  book_authors2 %>% 
    count(Authors, name="size") %>% 
    mutate(size=(log(size)*7)+5) %>% 
    rename(id=Authors) %>% 
    mutate(group="Authors")
  ,
  book_authors2 %>% 
    rename(id=Publisher) %>% 
    mutate(size=10, group="Publishers") %>% 
    select(id, size, group) %>% 
    distinct() 
) %>% mutate(label=id)

edges; nodes # quick look at structure
## # A tibble: 160 x 3
## # Groups:   from, to [160]
##    from                      to                          size
##    <chr>                     <chr>                      <int>
##  1 Albinš,Ugis               Popper                         1
##  2 Andersone,Gita            Apgads Zvaigzne ABC            3
##  3 Androutsopoulos,Evangelos Grafiskie stasti               1
##  4 Arne,Kerija               Popper                         1
##  5 Ava,GeorgsHarijs          Latvijas Makslas akademija     1
##  6 Baceviciene,Neringa       Apgads Zvaigzne ABC            1
##  7 Baeza,Amanda              Grafiskie stasti               1
##  8 Baiba,Baiba               Popper                         1
##  9 Bell,Marc                 Grafiskie stasti               1
## 10 Benša,Katarina            LATVIJAS MEDIJI                1
## # ... with 150 more rows
## # A tibble: 175 x 4
##    id                         size group   label                    
##    <chr>                     <dbl> <chr>   <chr>                    
##  1 Albinš,Ugis                 5   Authors Albinš,Ugis              
##  2 Andersone,Gita             12.7 Authors Andersone,Gita           
##  3 Androutsopoulos,Evangelos   5   Authors Androutsopoulos,Evangelos
##  4 Arne,Kerija                 5   Authors Arne,Kerija              
##  5 Ava,GeorgsHarijs            5   Authors Ava,GeorgsHarijs         
##  6 Baceviciene,Neringa         5   Authors Baceviciene,Neringa      
##  7 Baeza,Amanda                5   Authors Baeza,Amanda             
##  8 Baiba,Baiba                 5   Authors Baiba,Baiba              
##  9 Bell,Marc                   5   Authors Bell,Marc                
## 10 Benša,Katarina              5   Authors Benša,Katarina           
## # ... with 165 more rows
visNetwork(nodes, edges)

——–

Various exercises from previous workshops (if we have time left over)

Animations

It’s also easy to generate animations in R, either with gganimate (produces gifs out of ggplots) or with plotly. With the latter, the ggplotly command can also produce animations, provided you specify the variable for the “frame” (and “ids”) parameter in ggplot, which will then be passed on to plotly (of course you can construct animations using the native plotly syntax as well, but that syntax is a bit different, so we won’t go into that today).

g = ggplot(gapminder,   # first let's create and save a ggplot
           aes(gdpPercap, lifeExp, 
               label=country,
               color = continent,
               size = pop,
               frame = year, ids = country   # these 2 parameters make the animation work!
               )) +
  geom_point(alpha=0.5) +    # this stuff should be somewhat familiar by now
  geom_text(size=2, color="black", alpha=0.3)+
  scale_x_continuous(limits=c(0,50000))+  # I'll exclude some outliers
  theme_minimal()+
  NULL

ggplotly(g) %>% style(textposition = "right")

Maps

Mapping is pretty easy too, we can just use ggplot. If doing more advanced mapping, have a look at the sf and leaflet packages. We’re plotting a very basic map here, but you could easily import more detailed polygon data to plot e.g. regional, dialectal or historical maps.

library(rworldmap)  # provides generic world map
library(maps)       # provides a dataset
library(shadowtext) # nicer text labels for ggplot

# This pulls a simple map of Estonia, and some data on settlements from a global dataset.
newmap = joinCountryData2Map(countryExData, joinCode = "ISO3", nameJoinColumn = "ISO3V10", mapResolution = "low") %>% fortify(mymap) %>% subset(id=="Latvia")
places = world.cities %>% filter(country.etc == "Latvia") %>% arrange(desc(pop))

ggplot(places,   # places is defined as the main dataset, for point and label locations
       aes(long, lat, 
           size=pop,   # size and color by population
           color=pop, 
           label=name)
       ) + 
  geom_polygon(data=newmap,  # the map is used as data for the polygons
               aes(long, lat, group = group), inherit.aes = F,
               color="black", fill="gray98", alpha=0.7) +
  geom_point()+
  geom_shadowtext(data=places %>% filter(pop>20000), 
                  hjust=1.1, vjust=-0.2,size=3, bg.color="white")+
  scale_size(range=c(0.5,10))+  # scale for the points
  coord_fixed()+
  theme_minimal()+
  theme(legend.position = "none", axis.title = element_blank())+
  NULL

Or if you’d like to create interactive 3D maps and terrain models, look into the rayshader package. I’ve copied a little example here, but it won’t work unless you got the rayshader and raster packages installed (if you want to give it a try, install and load them later yourself).

install.packages("rayshader") # this requires an extra package to be installed
library(rayshader)  # package for shaded terrain models

# 2D map:
map2d = montereybay  %>%
  sphere_shade(texture = "desert")  %>%
  add_shadow(ray_shade(montereybay, sunaltitude=20, zscale=50),
             max_darken=0.1)
plot_map(map2d)

# 3D map (this may take a moment to render!)
map3d = montereybay  %>%
  sphere_shade(texture = "desert") %>%
  add_shadow(ambient_shade(montereybay), 0) %>%
  add_shadow(ray_shade(montereybay, zscale = 1, sunaltitude = 89),max_darken=0.7)

plot_3d(map3d, montereybay, 
        water = TRUE, waterdepth = -90,
        watercolor="imhof2", waterlinecolor="white", waterlinealpha=0.3,
        zscale = 50,              # elevation exaggeration
        fov = 0, theta = 135, zoom = 0.5, phi = 45, 
        windowsize = c(x=30, y=10, w=1200, h=700), # adjust window size if needed
        triangulate=T, max_error = 0.1) # lower values = sharper plot (slower!), set triangulate=F to disable optimizer

We looked into the plotly package which does interactive plots. We didn’t try it, but it also does 3D plots, just need to use its own syntax, which we would not have had time today.

plot_ly(data=gapminder %>% filter(year==2007, continent!="Asia"),  
        x=~log10(pop), y=~lifeExp, z=~gdpPercap, 
        color=~continent,
        type="scatter3d", mode="markers", 
        marker=list(opacity=0.6, size=5) 
        ) %>% 
  layout(xaxis = list(type = "log"))

Making websites and slides from R

We’ve looked into a few ways of making plots, including interactive plots, how can we use them in presentations and publications? One way is to just click the export button on the Plots (or Viewer) tab on the right, or for ggplots, use the ggsave() command. But we can also create webpages (or html-slides) with data and plots embedded.

Try it: go to File -> New File -> RMarkdown -> leave the default, “Document” selected, and press OK. This will open a new tab in the script pane, with another R markdown document (this worksheet is also an R markdown document). It has some sample content in it - just click the blue “Knit” button up top to render it as a webpage. You could also try adding your own content - but remember to load all data and packages. Copy-paste this block there and knit again:

library(gapminder)
library(ggplot2)
library(plotly)

g = ggplot(gapminder, aes(x=lifeExp, y=pop, color=continent, text=country))+
  geom_point()+
  theme_minimal()+                # add before before theme()
  theme(legend.position = "top")+ # add after theme_* if one is present
  NULL
g+labs(title="A static plot")

ggplotly(g+labs(title="An interactive plot"))




Final words on attributions, citing and references.

A word on R and its packages. It’s all free open-source software, meaning countless people have invested a lot of their own time into making this possible. If you use R, do cite it in your work (use the handy citation() command in R to get an up to date reference, both in plain text and BibTeX). To cite a package, use citation("package name"). You are also absolutely welcome to use any piece of code from this workshop, but I would likewise appreciate some acknowledgement of that:)


Do play around with these exercises later when you have time, and look into the bonus sections below for extras. If you get stuck, Google is your friend; also, check out www.stackoverflow.com - this site is a goldmine of programming (including R) questions and solutions.

Also, if you are looking for consulting on data analysis and visualization, or info on upcoming workshops, take a look at my website https://andreskarjus.github.io/ If you’d like to stay updated keep an eye on my Twitter @AndresKarjus (for science content) and @aRtofdataviz (for R, dataviz and workshops related stuff).